SUMMER OLYMPICS ANALYSIS FROM 1896 TO 2016¶

The "Summer Olympics Analysis from 1896 to 2016" involves a comprehensive examination of the data related to the Summer Olympic Games over this period. This analysis can cover various aspects, including participation, performance, trends, and insights into the evolution of the games. Below are some key areas of analysis:

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import numpy as np
import pandas as pd

About Data¶

In [3]:
df=pd.read_csv('athlete_events.csv')  # Data set from kaggle
region_df=pd.read_csv('noc_regions.csv')
In [4]:
df.tail()
Out[4]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
271111 135569 Andrzej ya M 29.0 179.0 89.0 Poland-1 POL 1976 Winter 1976 Winter Innsbruck Luge Luge Mixed (Men)'s Doubles NaN
271112 135570 Piotr ya M 27.0 176.0 59.0 Poland POL 2014 Winter 2014 Winter Sochi Ski Jumping Ski Jumping Men's Large Hill, Individual NaN
271113 135570 Piotr ya M 27.0 176.0 59.0 Poland POL 2014 Winter 2014 Winter Sochi Ski Jumping Ski Jumping Men's Large Hill, Team NaN
271114 135571 Tomasz Ireneusz ya M 30.0 185.0 96.0 Poland POL 1998 Winter 1998 Winter Nagano Bobsleigh Bobsleigh Men's Four NaN
271115 135571 Tomasz Ireneusz ya M 34.0 185.0 96.0 Poland POL 2002 Winter 2002 Winter Salt Lake City Bobsleigh Bobsleigh Men's Four NaN
In [5]:
df.shape
Out[5]:
(271116, 15)
In [6]:
df = df[df['Season'] == 'Summer']
In [7]:
df.shape
Out[7]:
(222552, 15)
In [8]:
region_df.tail()
Out[8]:
NOC region notes
225 YEM Yemen NaN
226 YMD Yemen South Yemen
227 YUG Serbia Yugoslavia
228 ZAM Zambia NaN
229 ZIM Zimbabwe NaN
In [9]:
region_df.shape
Out[9]:
(230, 3)
In [10]:
df = df.merge(region_df,on='NOC',how='left')
In [11]:
df.columns
Out[11]:
Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal', 'region', 'notes'],
      dtype='object')

Columns Deitals¶

    1. ID

Description: This is a unique identifier for each athlete in the dataset. It helps in tracking individual records and ensuring data accuracy. Use: Can be used to differentiate between athletes with the same name or other similar attributes.

    1. Name

Description: The full name of the athlete. Use: Useful for identifying specific athletes and could be used in analyses focused on the performance of individual athletes or for visualization purposes.

    1. Sex

Description: Gender of the athlete (Male/Female). Use: Crucial for gender-based analysis, such as studying gender participation trends or comparing performances between male and female athletes.

    1. Age

Description: Age of the athlete during the Olympic event. Use: Useful for analyzing the age distribution of athletes, identifying trends in the ages of medal winners, or studying the impact of age on performance.

    1. Height

Description: Height of the athlete in centimeters. Use: Can be used to analyze the relationship between height and performance in different sports or events.

    1. Weight

Description: Weight of the athlete in kilograms. Use: Similar to height, this can be used to study the correlation between an athlete's weight and their performance, especially in sports where physical attributes are critical.

    1. Team

Description: The team or country the athlete represented. Use: Essential for country-based analysis, such as comparing the performance of different countries or studying the geographic distribution of medals.

    1. NOC

Description: National Olympic Committee code, a three-letter abbreviation representing the athlete's country. Use: Used to identify the country in a standardized format, helpful in aggregating and comparing data across countries.

    1. Games

Description: The specific Olympic Games in which the athlete participated (e.g., "2000 Summer"). Use: Useful for filtering data based on specific Olympic Games, analyzing trends over time, or comparing performances across different Olympics.

    1. Year

Description: The year in which the Olympic Games took place. Use: Allows for time-series analysis, such as studying trends across different years or the evolution of specific sports.

    1. Season

Description: Indicates whether the event is part of the Summer or Winter Olympics. Since your project is about the Summer Olympics, this will likely always be "Summer." Use: While this may not vary in your dataset, it's useful for filtering and confirming that all data pertains to the correct season.

    1. City

Description: The city where the Olympics were held. Use: Useful for location-based analysis, such as comparing the performance of athletes in different host cities or studying the impact of hosting on local athletes.

    1. Sport

Description: The broader category of sport in which the athlete competed (e.g., "Athletics"). Use: Allows for analysis at the sport level, such as comparing participation rates, medal counts, or trends within specific sports.

    1. Event

Description: The specific event within the sport (e.g., "100m sprint"). Use: Useful for detailed analysis within sports, such as comparing performances across different events or identifying events with the highest competition.

    1. Medal

Description: Indicates the type of medal won by the athlete (Gold, Silver, Bronze, or None). Use: Central to medal analysis, including identifying top-performing countries, athletes, or sports, and studying trends in medal distribution.

    1. Region

Description: The broader region associated with the athlete's country (e.g., "Europe"). Use: Enables regional analysis, such as comparing performance across different regions or studying the impact of regional factors on athletic success.

    1. Notes

Description: Any additional information or remarks related to the athlete or event. Use: This column may contain miscellaneous data that could be useful for specific cases or deeper insights.

Column wise Analysis¶

1. ID¶

In [12]:
df['ID'].value_counts().sum()
Out[12]:
222552

The 'ID' column is a unique identifier for each athlete, but it doesn't provide any meaningful information for analysis. It doesn't contribute to understanding trends, patterns, or relationships within the data. So lets drop it

In [13]:
df.drop(['ID'],axis=1,inplace=True)
In [14]:
df.columns
Out[14]:
Index(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal', 'region', 'notes'],
      dtype='object')

2. Name¶

In [15]:
df['Name'].value_counts().sum()
Out[15]:
222552
In [16]:
df['Name'].isnull().sum()
Out[16]:
0

3. Sex¶

In [17]:
df['Sex'].value_counts()
Out[17]:
Sex
M    163109
F     59443
Name: count, dtype: int64
In [18]:
df['Sex'].isnull().sum()
Out[18]:
0

4. Age¶

In [19]:
df['Age'].value_counts()
Out[19]:
Age
23.0    17521
24.0    17200
22.0    16852
25.0    15733
21.0    15519
        ...  
77.0        2
84.0        1
96.0        1
10.0        1
97.0        1
Name: count, Length: 74, dtype: int64
In [20]:
df['Age'].unique()
Out[20]:
array([24., 23., 34., 18., 26., 31., 30., 22., 28., 32., 54., 27., 43.,
       47., 21., 25., 20., 33., 41., 45., 49., 53., 57., nan, 19., 29.,
       38., 35., 16., 37., 42., 46., 40., 15., 36., 17., 14., 39., 48.,
       52., 55., 50., 44., 71., 63., 51., 58., 60., 75., 65., 13., 56.,
       64., 68., 84., 12., 72., 59., 61., 70., 74., 62., 67., 69., 73.,
       66., 76., 88., 11., 96., 80., 10., 81., 77., 97.])
In [21]:
df['Age'].isnull().sum()
Out[21]:
9189

5. Height¶

In [22]:
df['Height'].value_counts()
Out[22]:
Height
170.0    9696
180.0    9589
175.0    8192
178.0    8162
168.0    6588
         ... 
226.0       3
131.0       2
130.0       2
219.0       2
128.0       1
Name: count, Length: 95, dtype: int64
In [23]:
df['Height'].unique()
Out[23]:
array([180., 170.,  nan, 168., 172., 159., 175., 189., 163., 187., 174.,
       194., 182., 185., 177., 167., 176., 164., 198., 165., 156., 179.,
       178., 166., 186., 190., 183., 181., 188., 155., 169., 160., 173.,
       171., 192., 195., 184., 191., 203., 158., 154., 153., 152., 193.,
       162., 197., 157., 207., 147., 150., 202., 161., 210., 205., 200.,
       148., 208., 140., 213., 204., 151., 196., 146., 201., 132., 149.,
       199., 211., 215., 139., 206., 135., 142., 217., 209., 145., 221.,
       144., 214., 143., 218., 127., 223., 216., 141., 212., 131., 136.,
       137., 138., 220., 130., 219., 133., 128., 226.])
In [24]:
df['Height'].isnull().sum()
Out[24]:
51857

6.Weight¶

In [25]:
df['Weight'].value_counts()
Out[25]:
Weight
70.0     7659
60.0     6389
75.0     6298
68.0     5794
65.0     5735
         ... 
149.0       1
180.0       1
48.5        1
190.0       1
151.0       1
Name: count, Length: 219, dtype: int64
In [26]:
df['Weight'].unique().shape
Out[26]:
(220,)
In [27]:
df['Weight'].isnull().sum()
Out[27]:
53854

7.Team¶

In [28]:
df['Team'].value_counts()
Out[28]:
Team
United States    14526
Great Britain    10309
France           10021
Italy             8051
Germany           7378
                 ...  
Souvenance           1
Crabe I-11           1
Crabe I-3            1
Crabe I-2            1
Digby                1
Name: count, Length: 1157, dtype: int64
In [29]:
df['Team'].unique().shape
Out[29]:
(1157,)
In [30]:
df['Team'].isnull().sum()
Out[30]:
0

8.NOC¶

In [31]:
df['NOC'].value_counts()
Out[31]:
NOC
USA    15064
GBR    10917
FRA    10633
ITA     8217
GER     7622
       ...  
YMD        5
SSD        3
NBO        2
UNK        2
NFL        1
Name: count, Length: 230, dtype: int64
In [32]:
df['NOC'].unique().shape
Out[32]:
(230,)
In [33]:
df['NOC'].isnull().sum()
Out[33]:
0

9.Games¶

In [34]:
df['Games'].value_counts()
Out[34]:
Games
2000 Summer    13821
1996 Summer    13780
2016 Summer    13688
2008 Summer    13602
2004 Summer    13443
1992 Summer    12977
2012 Summer    12920
1988 Summer    12037
1972 Summer    10304
1984 Summer     9454
1976 Summer     8641
1968 Summer     8588
1952 Summer     8270
1960 Summer     8119
1964 Summer     7702
1980 Summer     7191
1936 Summer     6506
1948 Summer     6405
1924 Summer     5233
1956 Summer     5127
1928 Summer     4992
1920 Summer     4292
1912 Summer     4040
1908 Summer     3101
1932 Summer     2969
1900 Summer     1936
1906 Summer     1733
1904 Summer     1301
1896 Summer      380
Name: count, dtype: int64
In [35]:
df['Games'].unique().shape
Out[35]:
(29,)
In [36]:
df['Games'].isnull().sum()
Out[36]:
0

10.Year¶

In [37]:
df['Year'].value_counts()
Out[37]:
Year
2000    13821
1996    13780
2016    13688
2008    13602
2004    13443
1992    12977
2012    12920
1988    12037
1972    10304
1984     9454
1976     8641
1968     8588
1952     8270
1960     8119
1964     7702
1980     7191
1936     6506
1948     6405
1924     5233
1956     5127
1928     4992
1920     4292
1912     4040
1908     3101
1932     2969
1900     1936
1906     1733
1904     1301
1896      380
Name: count, dtype: int64
In [38]:
df['Year'].unique().shape
Out[38]:
(29,)
In [39]:
df['Year'].isnull().sum()
Out[39]:
0

11.Seasion¶

In [40]:
df['Season'].value_counts()
Out[40]:
Season
Summer    222552
Name: count, dtype: int64
In [41]:
#As we know the that we are working in only Summer Olympics data, we can drop this column
df.drop(columns=['Season'],axis=1,inplace=True)
In [42]:
df.columns
Out[42]:
Index(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'City', 'Sport', 'Event', 'Medal', 'region', 'notes'],
      dtype='object')

12. City¶

In [43]:
df['City'].value_counts()
Out[43]:
City
London            22426
Athina            15556
Sydney            13821
Atlanta           13780
Rio de Janeiro    13688
Beijing           13602
Barcelona         12977
Los Angeles       12423
Seoul             12037
Munich            10304
Montreal           8641
Mexico City        8588
Helsinki           8270
Roma               8119
Tokyo              7702
Moskva             7191
Paris              7169
Berlin             6506
Amsterdam          4992
Melbourne          4829
Stockholm          4338
Antwerpen          4292
St. Louis          1301
Name: count, dtype: int64
In [44]:
df['City'].unique().shape
Out[44]:
(23,)
In [45]:
df['City'].isnull().sum()
Out[45]:
0

13. Sport¶

In [46]:
df['Sport'].value_counts()
Out[46]:
Sport
Athletics                38624
Gymnastics               26707
Swimming                 23195
Shooting                 11448
Cycling                  10859
Fencing                  10735
Rowing                   10595
Wrestling                 7154
Football                  6745
Sailing                   6586
Equestrianism             6344
Canoeing                  6171
Boxing                    6047
Hockey                    5417
Basketball                4536
Weightlifting             3937
Water Polo                3846
Judo                      3801
Handball                  3665
Art Competitions          3578
Volleyball                3404
Tennis                    2862
Diving                    2842
Archery                   2334
Table Tennis              1955
Modern Pentathlon         1677
Badminton                 1457
Synchronized Swimming      909
Baseball                   894
Rhythmic Gymnastics        658
Taekwondo                  606
Beach Volleyball           564
Triathlon                  529
Softball                   478
Rugby Sevens               299
Golf                       247
Tug-Of-War                 170
Rugby                      162
Trampolining               152
Polo                        95
Lacrosse                    60
Ice Hockey                  60
Figure Skating              54
Cricket                     24
Croquet                     19
Motorboating                17
Racquets                    12
Jeu De Paume                11
Roque                        4
Alpinism                     4
Basque Pelota                2
Aeronautics                  1
Name: count, dtype: int64
In [47]:
df['Sport'].unique().shape
Out[47]:
(52,)
In [48]:
df['Sport'].isnull().sum()
Out[48]:
0

14. Event¶

In [49]:
df['Event'].value_counts()
Out[49]:
Event
Football Men's Football                                5733
Hockey Men's Hockey                                    3958
Water Polo Men's Water Polo                            3358
Basketball Men's Basketball                            3280
Cycling Men's Road Race, Individual                    2947
                                                       ... 
Archery Men's Target Archery, 50 metres, Individual       2
Sailing Mixed 18 foot                                     2
Croquet Mixed Doubles                                     2
Basque Pelota Men's Two-Man Teams With Cesta              2
Aeronautics Mixed Aeronautics                             1
Name: count, Length: 651, dtype: int64
In [50]:
df['Event'].unique().shape
Out[50]:
(651,)
In [51]:
df['Event'].isnull().sum()
Out[51]:
0

15.Medal¶

In [52]:
df['Medal'].value_counts()
Out[52]:
Medal
Gold      11459
Bronze    11409
Silver    11220
Name: count, dtype: int64
In [53]:
df['Medal'].unique().shape
Out[53]:
(4,)
In [54]:
df['Medal'].isnull().sum()
Out[54]:
188464

16.region¶

In [55]:
df['region'].value_counts()
Out[55]:
region
USA            15064
Germany        12377
UK             10917
France         10633
Russia          8855
               ...  
Kiribati          11
Brunei            10
Timor-Leste        8
Kosovo             8
South Sudan        3
Name: count, Length: 205, dtype: int64
In [56]:
df['region'].unique().shape
Out[56]:
(206,)
In [57]:
df['region'].isnull().sum()
Out[57]:
370

17. notes¶

In [58]:
df['notes'].value_counts()
Out[58]:
notes
Yugoslavia                     2046
Hong Kong                       673
Trinidad and Tobago             368
Serbia and Montenegro           300
Virgin Islands                  239
Bohemia                         153
Antigua and Barbuda             133
United Arab Republic            123
Individual Olympic Athletes      93
Australasia                      86
Netherlands Antilles             74
Turks and Caicos Islands         42
West Indies Federation           20
Refugee Olympic Team             12
North Yemen                      11
Crete                            11
Tuvalu                            7
South Yemen                       5
North Borneo                      2
Unknown                           2
Newfoundland                      1
Name: count, dtype: int64
In [59]:
df['notes'].unique().shape
Out[59]:
(22,)
In [60]:
df['notes'].isnull().sum()
Out[60]:
218151

Medal Tally¶

In [61]:
df['region'].unique().shape
Out[61]:
(206,)
In [62]:
df.isnull().sum()
Out[62]:
Name           0
Sex            0
Age         9189
Height     51857
Weight     53854
Team           0
NOC            0
Games          0
Year           0
City           0
Sport          0
Event          0
Medal     188464
region       370
notes     218151
dtype: int64
  • As hear we see their are no missing values in NOC columns and thair are 370 missing values in region columns,It is due thair are some columns where NOC is not converted to region in the dataset
In [63]:
df.duplicated().sum()
Out[63]:
1385
In [64]:
df.drop_duplicates(inplace=True)
In [65]:
df.duplicated().sum()
Out[65]:
0
In [66]:
df['Medal'].value_counts()
Out[66]:
Medal
Gold      11456
Bronze    11409
Silver    11212
Name: count, dtype: int64
In [67]:
df['Medal'].unique()
Out[67]:
array([nan, 'Gold', 'Bronze', 'Silver'], dtype=object)
In [68]:
pd.get_dummies(df['Medal'])
Out[68]:
Bronze Gold Silver
0 False False False
1 False False False
2 False False False
3 False True False
4 False False False
... ... ... ...
222547 False False False
222548 False False False
222549 False False False
222550 False False False
222551 False False False

221167 rows × 3 columns

In [69]:
# concatinate with original dataframe
df=pd.concat([df,pd.get_dummies(df['Medal'])],axis=1) # One hot incoding
In [70]:
df.dtypes
Out[70]:
Name       object
Sex        object
Age       float64
Height    float64
Weight    float64
Team       object
NOC        object
Games      object
Year        int64
City       object
Sport      object
Event      object
Medal      object
region     object
notes      object
Bronze       bool
Gold         bool
Silver       bool
dtype: object
In [71]:
df[['Bronze', 'Gold', 'Silver']] = df[['Bronze', 'Gold', 'Silver']].astype(int)
In [72]:
df.head()
Out[72]:
Name Sex Age Height Weight Team NOC Games Year City Sport Event Medal region notes Bronze Gold Silver
0 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Barcelona Basketball Basketball Men's Basketball NaN China NaN 0 0 0
1 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 London Judo Judo Men's Extra-Lightweight NaN China NaN 0 0 0
2 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Antwerpen Football Football Men's Football NaN Denmark NaN 0 0 0
3 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold Denmark NaN 0 1 0
4 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Los Angeles Athletics Athletics Women's 100 metres NaN Netherlands NaN 0 0 0
In [73]:
df.shape
Out[73]:
(221167, 18)
In [74]:
df.shape
Out[74]:
(221167, 18)
In [75]:
# groupby with NOC
df.groupby('NOC').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index()
Out[75]:
NOC Gold Silver Bronze
0 USA 2472 1333 1197
1 URS 832 635 596
2 GBR 635 729 620
3 GER 592 538 649
4 ITA 518 474 454
... ... ... ... ...
225 AHO 0 1 0
226 LBR 0 0 0
227 LCA 0 0 0
228 LES 0 0 0
229 LBA 0 0 0

230 rows × 4 columns

  • It is a wrong output because if a country win a gold medal in a perticular game that means its gold medal count is one,but if the perticular event has number of players then each players win gold medal each thats why number of gold medal is showing much more and its same for all medal.
  • Below one example for india
In [76]:
df[(df['NOC']=='IND') & (df['Medal'] == 'Gold')]
Out[76]:
Name Sex Age Height Weight Team NOC Games Year City Sport Event Medal region notes Bronze Gold Silver
4186 Shaukat Ali M 30.0 NaN NaN India IND 1928 Summer 1928 Amsterdam Hockey Hockey Men's Hockey Gold India NaN 0 1 0
4190 Syed Mushtaq Ali M 22.0 165.0 61.0 India IND 1964 Summer 1964 Tokyo Hockey Hockey Men's Hockey Gold India NaN 0 1 0
4460 Richard James Allen M 25.0 172.0 NaN India IND 1928 Summer 1928 Amsterdam Hockey Hockey Men's Hockey Gold India NaN 0 1 0
4461 Richard James Allen M 30.0 172.0 NaN India IND 1932 Summer 1932 Los Angeles Hockey Hockey Men's Hockey Gold India NaN 0 1 0
4462 Richard James Allen M 34.0 172.0 NaN India IND 1936 Summer 1936 Berlin Hockey Hockey Men's Hockey Gold India NaN 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
191759 Dung Dung Sylvanus M 31.0 160.0 62.0 India IND 1980 Summer 1980 Moskva Hockey Hockey Men's Hockey Gold India NaN 0 1 0
193964 Carlyle Carrol Tapsell M 23.0 182.0 NaN India IND 1932 Summer 1932 Los Angeles Hockey Hockey Men's Hockey Gold India NaN 0 1 0
193965 Carlyle Carrol Tapsell M 27.0 182.0 NaN India IND 1936 Summer 1936 Berlin Hockey Hockey Men's Hockey Gold India NaN 0 1 0
205949 Max "Maxie" Vaz M NaN NaN NaN India IND 1948 Summer 1948 London Hockey Hockey Men's Hockey Gold India NaN 0 1 0
219336 Sayed Muhammad Yusuf M NaN NaN NaN India IND 1928 Summer 1928 Amsterdam Hockey Hockey Men's Hockey Gold India NaN 0 1 0

131 rows × 18 columns

  • For Correct this we drop duplicates from following columns ['Team','NOC','Games','Year','City','Sport','Event','Medal']
In [77]:
medal_tally=df.drop_duplicates(subset=['Team','NOC','Games','Year','City','Sport','Event','Medal'])
In [78]:
medal_tally.groupby('NOC').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index()
Out[78]:
NOC Gold Silver Bronze
0 USA 1035 802 708
1 URS 394 317 294
2 GBR 278 317 300
3 GER 235 261 283
4 FRA 234 256 287
... ... ... ... ...
225 AHO 0 1 0
226 LBR 0 0 0
227 LCA 0 0 0
228 LES 0 0 0
229 LBA 0 0 0

230 rows × 4 columns

In [79]:
medal_tally[medal_tally['NOC'] == 'IND']
Out[79]:
Name Sex Age Height Weight Team NOC Games Year City Sport Event Medal region notes Bronze Gold Silver
436 S. Abdul Hamid M NaN NaN NaN India IND 1928 Summer 1928 Amsterdam Athletics Athletics Men's 110 metres Hurdles NaN India NaN 0 0 0
437 S. Abdul Hamid M NaN NaN NaN India IND 1928 Summer 1928 Amsterdam Athletics Athletics Men's 400 metres Hurdles NaN India NaN 0 0 0
790 Shiny Kurisingal Abraham-Wilson F 19.0 167.0 53.0 India IND 1984 Summer 1984 Los Angeles Athletics Athletics Women's 800 metres NaN India NaN 0 0 0
791 Shiny Kurisingal Abraham-Wilson F 19.0 167.0 53.0 India IND 1984 Summer 1984 Los Angeles Athletics Athletics Women's 4 x 400 metres Relay NaN India NaN 0 0 0
792 Shiny Kurisingal Abraham-Wilson F 23.0 167.0 53.0 India IND 1988 Summer 1988 Seoul Athletics Athletics Women's 800 metres NaN India NaN 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
216942 Vikas Krishan Yadav M 24.0 177.0 69.0 India IND 2016 Summer 2016 Rio de Janeiro Boxing Boxing Men's Middleweight NaN India NaN 0 0 0
216985 Mohammad Anas Yahiya M 21.0 177.0 69.0 India IND 2016 Summer 2016 Rio de Janeiro Athletics Athletics Men's 400 metres NaN India NaN 0 0 0
218411 Thyadathuvilla Chandrapillai "T. C." Yohannan M 29.0 174.0 62.0 India IND 1976 Summer 1976 Montreal Athletics Athletics Men's Long Jump NaN India NaN 0 0 0
222416 Geeta Zutshi F 23.0 167.0 51.0 India IND 1980 Summer 1980 Moskva Athletics Athletics Women's 800 metres NaN India NaN 0 0 0
222417 Geeta Zutshi F 27.0 167.0 51.0 India IND 1984 Summer 1984 Los Angeles Athletics Athletics Women's 3,000 metres NaN India NaN 0 0 0

687 rows × 18 columns

In [80]:
medal_tally = medal_tally.groupby('region').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index()
In [81]:
medal_tally['total'] = medal_tally['Gold'] + medal_tally['Silver'] + medal_tally['Bronze']
In [82]:
medal_tally
Out[82]:
region Gold Silver Bronze total
0 USA 1035 802 708 2545
1 Russia 592 498 487 1577
2 Germany 444 457 491 1392
3 UK 278 317 300 895
4 France 234 256 287 777
... ... ... ... ... ...
200 Lesotho 0 0 0 0
201 Albania 0 0 0 0
202 Libya 0 0 0 0
203 Liechtenstein 0 0 0 0
204 Liberia 0 0 0 0

205 rows × 5 columns

In Which Year Olympics was Played¶

In [83]:
years = df['Year'].unique().tolist()
In [84]:
years.sort()
In [85]:
years.insert(0,'Overall')
In [86]:
years
Out[86]:
['Overall',
 1896,
 1900,
 1904,
 1906,
 1908,
 1912,
 1920,
 1924,
 1928,
 1932,
 1936,
 1948,
 1952,
 1956,
 1960,
 1964,
 1968,
 1972,
 1976,
 1980,
 1984,
 1988,
 1992,
 1996,
 2000,
 2004,
 2008,
 2012,
 2016]
  • Till 2016 , Summer Olympics was played 29 times staring from 1896 to 2016

Which Country Participate in Summer Olympics¶

In [87]:
country = np.unique(df['region'].dropna().values).tolist()
In [88]:
country.sort()
In [89]:
country
Out[89]:
['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Antigua',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Boliva',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Curacao',
 'Cyprus',
 'Czech Republic',
 'Democratic Republic of the Congo',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Grenada',
 'Guam',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Honduras',
 'Hungary',
 'Iceland',
 'India',
 'Individual Olympic Athletes',
 'Indonesia',
 'Iran',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Ivory Coast',
 'Jamaica',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kiribati',
 'Kosovo',
 'Kuwait',
 'Kyrgyzstan',
 'Laos',
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Macedonia',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Mauritania',
 'Mauritius',
 'Mexico',
 'Micronesia',
 'Moldova',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nauru',
 'Nepal',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'North Korea',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Palestine',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Congo',
 'Romania',
 'Russia',
 'Rwanda',
 'Saint Kitts',
 'Saint Lucia',
 'Saint Vincent',
 'Samoa',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Slovakia',
 'Slovenia',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Korea',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'Sudan',
 'Suriname',
 'Swaziland',
 'Sweden',
 'Switzerland',
 'Syria',
 'Taiwan',
 'Tajikistan',
 'Tanzania',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Tonga',
 'Trinidad',
 'Tunisia',
 'Turkey',
 'Turkmenistan',
 'UK',
 'USA',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela',
 'Vietnam',
 'Virgin Islands, British',
 'Virgin Islands, US',
 'Yemen',
 'Zambia',
 'Zimbabwe']
In [90]:
country.insert(0,'Overall')
In [91]:
country
Out[91]:
['Overall',
 'Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Antigua',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Boliva',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Curacao',
 'Cyprus',
 'Czech Republic',
 'Democratic Republic of the Congo',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Grenada',
 'Guam',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Honduras',
 'Hungary',
 'Iceland',
 'India',
 'Individual Olympic Athletes',
 'Indonesia',
 'Iran',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Ivory Coast',
 'Jamaica',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kiribati',
 'Kosovo',
 'Kuwait',
 'Kyrgyzstan',
 'Laos',
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Macedonia',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Mauritania',
 'Mauritius',
 'Mexico',
 'Micronesia',
 'Moldova',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nauru',
 'Nepal',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'North Korea',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Palestine',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Congo',
 'Romania',
 'Russia',
 'Rwanda',
 'Saint Kitts',
 'Saint Lucia',
 'Saint Vincent',
 'Samoa',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Slovakia',
 'Slovenia',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Korea',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'Sudan',
 'Suriname',
 'Swaziland',
 'Sweden',
 'Switzerland',
 'Syria',
 'Taiwan',
 'Tajikistan',
 'Tanzania',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Tonga',
 'Trinidad',
 'Tunisia',
 'Turkey',
 'Turkmenistan',
 'UK',
 'USA',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela',
 'Vietnam',
 'Virgin Islands, British',
 'Virgin Islands, US',
 'Yemen',
 'Zambia',
 'Zimbabwe']
In [92]:
medal_df = df.drop_duplicates(subset=['Team', 'NOC', 'Games', 'Year', 'City', 'Sport', 'Event', 'Medal'])
In [93]:
medal_df.head()
Out[93]:
Name Sex Age Height Weight Team NOC Games Year City Sport Event Medal region notes Bronze Gold Silver
0 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Barcelona Basketball Basketball Men's Basketball NaN China NaN 0 0 0
1 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 London Judo Judo Men's Extra-Lightweight NaN China NaN 0 0 0
2 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Antwerpen Football Football Men's Football NaN Denmark NaN 0 0 0
3 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold Denmark NaN 0 1 0
4 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Los Angeles Athletics Athletics Women's 100 metres NaN Netherlands NaN 0 0 0
In [94]:
def fetch_medal_tally(year, country):
    # Drop duplicates based on the relevant columns
    medal_df = df.drop_duplicates(subset=['Team', 'NOC', 'Games', 'Year', 'City', 'Sport', 'Event', 'Medal'])
    flag = 0
    
    # Filter the DataFrame based on the input year and country
    if year == 'Overall' and country == 'Overall':
        temp_df = medal_df
    if year == 'Overall' and country != 'Overall':
        flag = 1
        temp_df = medal_df[medal_df['region'] == country]
    if year != 'Overall' and country == 'Overall':
        temp_df = medal_df[medal_df['Year'] == int(year)]
    if year != 'Overall' and country != 'Overall':
        temp_df = medal_df[(medal_df['Year'] == 2016) & (medal_df['region'] == country)]
    
    # Group and summarize the data
    if flag == 1:
        x = temp_df.groupby('Year').sum()[['Gold', 'Silver', 'Bronze']].sort_values('Year').reset_index()
    else:
        x = temp_df.groupby('region').sum()[['Gold', 'Silver', 'Bronze']].sort_values('Gold', ascending=False).reset_index()
    
    # Calculate the total medals
    x['Total'] = x['Gold'] + x['Silver'] + x['Bronze']
    
    print(x)
In [95]:
fetch_medal_tally(year='Overall',country='USA')
    Year  Gold  Silver  Bronze  Total
0   1896    11       6       2     19
1   1900    21      16      17     54
2   1904    76      79      76    231
3   1906    12       5       6     23
4   1908    22      12      12     46
5   1912    26      18      19     63
6   1920    41      27      27     95
7   1924    45      27      27     99
8   1928    22      18      16     56
9   1932    44      36      30    110
10  1936    24      21      12     57
11  1948    38      27      19     84
12  1952    40      19      17     76
13  1956    32      25      17     74
14  1960    34      21      16     71
15  1964    36      26      28     90
16  1968    45      28      34    107
17  1972    33      31      30     94
18  1976    34      35      25     94
19  1984    82      61      30    173
20  1988    36      31      27     94
21  1992    37      34      37    108
22  1996    44      32      25    101
23  2000    36      24      31     91
24  2004    36      39      26    101
25  2008    36      39      35    110
26  2012    46      28      29    103
27  2016    46      37      38    121
In [96]:
medal_df[(medal_df['Year'] == 2016) & (medal_df['region']  == 'India')]
Out[96]:
Name Sex Age Height Weight Team NOC Games Year City Sport Event Medal region notes Bronze Gold Silver
1015 Sharath Kamal Achanta M 34.0 186.0 85.0 India IND 2016 Summer 2016 Rio de Janeiro Table Tennis Table Tennis Men's Singles NaN India NaN 0 0 0
7065 Seema Antil F 33.0 182.0 92.0 India IND 2016 Summer 2016 Rio de Janeiro Athletics Athletics Women's Discus Throw NaN India NaN 0 0 0
8713 Aditi Ashok F 18.0 173.0 57.0 India IND 2016 Summer 2016 Rio de Janeiro Golf Golf Women's Individual NaN India NaN 0 0 0
9202 Manu Attri M 23.0 172.0 73.0 India IND 2016 Summer 2016 Rio de Janeiro Badminton Badminton Men's Doubles NaN India NaN 0 0 0
10070 Lalita Shivaji Babar F 27.0 166.0 50.0 India IND 2016 Summer 2016 Rio de Janeiro Athletics Athletics Women's 3,000 metres Steeplechase NaN India NaN 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
182260 Sathish Kumar Sivalingam M 24.0 175.0 77.0 India IND 2016 Summer 2016 Rio de Janeiro Weightlifting Weightlifting Men's Middleweight NaN India NaN 0 0 0
195568 Shiva Thapa M 22.0 169.0 56.0 India IND 2016 Summer 2016 Rio de Janeiro Boxing Boxing Men's Bantamweight NaN India NaN 0 0 0
198042 Sandeep Tomar M 25.0 168.0 61.0 India IND 2016 Summer 2016 Rio de Janeiro Wrestling Wrestling Men's Featherweight, Freestyle NaN India NaN 0 0 0
216942 Vikas Krishan Yadav M 24.0 177.0 69.0 India IND 2016 Summer 2016 Rio de Janeiro Boxing Boxing Men's Middleweight NaN India NaN 0 0 0
216985 Mohammad Anas Yahiya M 21.0 177.0 69.0 India IND 2016 Summer 2016 Rio de Janeiro Athletics Athletics Men's 400 metres NaN India NaN 0 0 0

68 rows × 18 columns

OVERALL ANALYSIS¶

1. N0.OF EDITIONS¶

In [97]:
df.head()
Out[97]:
Name Sex Age Height Weight Team NOC Games Year City Sport Event Medal region notes Bronze Gold Silver
0 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Barcelona Basketball Basketball Men's Basketball NaN China NaN 0 0 0
1 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 London Judo Judo Men's Extra-Lightweight NaN China NaN 0 0 0
2 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Antwerpen Football Football Men's Football NaN Denmark NaN 0 0 0
3 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold Denmark NaN 0 1 0
4 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Los Angeles Athletics Athletics Women's 100 metres NaN Netherlands NaN 0 0 0
In [98]:
df['Year'].unique().shape
Out[98]:
(29,)
  • Till 2016 There 28 time olympics was conducted, But as per our data its showing 29.Because we all know that olympics conducted every 4 years of interval but after 1904 its next sceducle year is 1906.But IOC not recognised 1906 event as a olympics game.

2.NO OF CITIES¶

In [99]:
df['City'].unique().shape
Out[99]:
(23,)
  • Till 2016, 23 cities host summer olympics

3.No of events/sports¶

In [100]:
df['Sport'].unique().shape
Out[100]:
(52,)
In [101]:
df['Event'].unique().shape
Out[101]:
(651,)
  • Till 2016, 651 Event conducted in 52 Sports

4.NO OF ATHELETES¶

In [102]:
df['Name'].unique().shape
Out[102]:
(116122,)
  • Total 116122 Atheletes Participate

5.No OF PARTICIPATING NATIONS¶

In [103]:
df['region'].unique().shape
Out[103]:
(206,)
  • Total 206 Nation Participate in summer olympics till 2016

Nation over Time¶

In [104]:
df.head()
Out[104]:
Name Sex Age Height Weight Team NOC Games Year City Sport Event Medal region notes Bronze Gold Silver
0 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Barcelona Basketball Basketball Men's Basketball NaN China NaN 0 0 0
1 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 London Judo Judo Men's Extra-Lightweight NaN China NaN 0 0 0
2 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Antwerpen Football Football Men's Football NaN Denmark NaN 0 0 0
3 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold Denmark NaN 0 1 0
4 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Los Angeles Athletics Athletics Women's 100 metres NaN Netherlands NaN 0 0 0
In [105]:
nations_over_time=df.drop_duplicates(['Year','region'])['Year'].value_counts().reset_index().sort_values('Year')
In [106]:
nations_over_time
Out[106]:
Year count
28 1896 12
22 1900 31
27 1904 14
26 1906 20
25 1908 22
24 1912 29
23 1920 29
21 1924 45
20 1928 46
19 1932 47
18 1936 49
17 1948 59
16 1952 67
15 1956 71
13 1960 83
11 1964 93
10 1968 111
9 1972 120
12 1976 91
14 1980 80
8 1984 139
7 1988 156
6 1992 168
5 1996 196
4 2000 199
3 2004 200
2 2008 202
1 2012 203
0 2016 204
In [107]:
nations_over_time.rename(columns={'Year':'Edition','count':'No of Countries'},inplace=True)
In [108]:
import plotly.express as px
In [109]:
fig = px.line(nations_over_time,x='Edition',y='No of Countries')
fig.show()
  • In 1896 its only 12 nation participate but in 2016 it jump upto 204

No.of Events over time (Every Sport)¶

In [110]:
import seaborn as sns
import matplotlib.pyplot as plt
In [111]:
x = df.drop_duplicates(['Year','Sport','Event'])
In [112]:
plt.figure(figsize=(20,20))
sns.heatmap(x.pivot_table(index='Sport',columns='Year',values='Event',aggfunc='count').fillna(0).astype('int'),annot=True)
Out[112]:
<Axes: xlabel='Year', ylabel='Sport'>
No description has been provided for this image
  • In Athletics, the number of Events are increasing from 12 in 1896 to 47 in 2016
  • Baseball was played 5 times in Summer Olympics history from 1992 to 2008 with 1 event each time
  • Boxing is continuously increasing with 13 events in 2016
  • Cricket was played only one time in the year 1900
  • Cycling has 18 events each in the last 5 Olympics
  • Except in 1896 and 1932 football was played in every Olympics with 2 events in the last 6 Olympics
  • Golf played 3 times i.e 1900,1904 and 2016 with 2 events in each Olympics
  • Gymnastics played in every Olympics, in the last 15 Olympics gymnastics was played in 14 different category
  • Except for 1st Olympics Rowing was played in every Olympics
  • Swimming played in every Olympics with 34 categories in 2016
  • Volleyball was added in 1964 with 2 events and continuing the same
  • Weightlifting and Wrestling two famous sports in Olympics history with 15 and 18 event in 2016 respectively

Most successful Athletes¶

In [113]:
def most_successful(df,sport):
    temp_df = df.dropna(subset=['Medal'])

    if sport != 'Overall':
        temp_df = temp_df[temp_df]['Sport'] == sport

    return temp_df['Name'].value_counts().reset_index()                                      
In [114]:
most_successful(df,'Overall')
Out[114]:
Name count
0 Michael Fred Phelps, II 28
1 Larysa Semenivna Latynina (Diriy-) 18
2 Nikolay Yefimovich Andrianov 15
3 Borys Anfiyanovych Shakhlin 13
4 Takashi Ono 13
... ... ...
24540 Joel Isasi Gonzlez 1
24541 Florena Andreea Isrescu 1
24542 Ryoji Isaoka 1
24543 Preben Isaksson 1
24544 rpd Lengyel 1

24545 rows × 2 columns

In [115]:
def most_successful(df, sport):
    # Drop rows where 'Medal' is NaN
    temp_df = df.dropna(subset=['Medal'])
    
    # Filter by sport if it's not 'Overall'
    if sport != 'Overall':
        temp_df = temp_df[temp_df['Sport'] == sport]
    
    # Get the top 15 names by medal count
    top_names = temp_df['Name'].value_counts().reset_index()
    top_names.columns = ['Name', 'count']
    
    # Merge with relevant columns from the original DataFrame
    result = top_names.merge(df[['Name', 'Sport', 'region']], on='Name', how='left').drop_duplicates()
    
    # Ensure to return relevant columns
    return result[['Name', 'count', 'Sport', 'region']]
In [116]:
most_successful(df,'Overall').head(15)
Out[116]:
Name count Sport region
0 Michael Fred Phelps, II 28 Swimming USA
30 Larysa Semenivna Latynina (Diriy-) 18 Gymnastics Russia
49 Nikolay Yefimovich Andrianov 15 Gymnastics Russia
73 Borys Anfiyanovych Shakhlin 13 Gymnastics Russia
97 Takashi Ono 13 Gymnastics Japan
129 Takashi Ono 13 Judo Japan
130 Edoardo Mangiarotti 13 Fencing Italy
144 Dara Grace Torres (-Hoffman, -Minas) 12 Swimming USA
157 Aleksey Yuryevich Nemov 12 Gymnastics Russia
178 Jennifer Elisabeth "Jenny" Thompson (-Cumpelik) 12 Swimming USA
195 Birgit Fischer-Schmidt 12 Canoeing Germany
208 Ryan Steven Lochte 12 Swimming USA
222 Paavo Johannes Nurmi 12 Athletics Finland
234 Sawao Kato 12 Gymnastics Japan
258 Natalie Anne Coughlin (-Hall) 12 Swimming USA
  • In the history of summer olympics Michael Fred Phelps from USA win 28 medal in Swimming.

Country wise¶

1. Countrywise medal tally per year(line plot)¶

In [117]:
temp_df=df.dropna(subset=['Medal'])
temp_df.drop_duplicates(subset=['Team','NOC','Games','Year','City','Sport','Event','Medal'],inplace=True)
In [118]:
new_df = temp_df[temp_df['region'] == 'India']
final_df=new_df.groupby('Year').count()['Medal'].reset_index()
In [119]:
fig = px.line(final_df,x='Year',y='Medal')
fig.show()
  • India win 6 medal in the 2012 olympics

2.What country are good at heatmap¶

In [120]:
new_df = temp_df[temp_df['region'] == 'Russia']
plt.figure(figsize=(20,20))
sns.heatmap(new_df.pivot_table(index='Sport',columns='Year',values='Medal',aggfunc='count').fillna(0),annot=True) 
Out[120]:
<Axes: xlabel='Year', ylabel='Sport'>
No description has been provided for this image
  • Russia won 41 medals in Athletics in the year 1980, which is the best achievement in a particular sport in a particular year in Summer Olympics history for Russia
  • For Russia Athletics, Gymnastics and wrestling are the top three sports
  • The USA won 65 medals in Athletics in the year 1904, which is the best achievement in a particular sport in a particular year in Summer Olympics history for the USA
  • For the USA Atheletics and Swiming are the top two sports
  • For India Hockey is the best sports

3. Most succesful Athletes(Top 10)¶

In [121]:
def most_successful(df, country):
    # Drop rows where 'Medal' is NaN
    temp_df = df.dropna(subset=['Medal'])
    
    # Filter by sport if it's not 'Overall'
    if country != 'Overall':
        temp_df = temp_df[temp_df['region'] == country]
    
    # Get the top 15 names by medal count
    top_names = temp_df['Name'].value_counts().reset_index()
    top_names.columns = ['Name', 'count']
    
    # Merge with relevant columns from the original DataFrame
    result = top_names.merge(df[['Name', 'Sport',]], on='Name', how='left').drop_duplicates()
    
    # Ensure to return relevant columns
    return result[['Name', 'count', 'Sport', ]]
In [122]:
most_successful(df,'China').head(10)
Out[122]:
Name count Sport
0 Yang Wei 7 Badminton
3 Yang Wei 7 Gymnastics
26 Wu Minxia 7 Diving
33 Wang Yifu 6 Shooting
43 Li Ning 6 Gymnastics
59 Zou Kai 6 Gymnastics
67 Sun Yang 6 Swimming
77 Wang Hao 6 Table Tennis
83 Wang Hao 6 Athletics
84 Wang Hao 6 Diving
  • For India, Udham Singh Kular and Leslie Walter Claudius are the two best Athelet with 4 medals each in Hockey
  • For Russia, Larysa Semenivna Latyninar is the best Athelet with 18 medals in Gymnastics
  • For Italy, Edoardo Mangiarotti is the best player with 13 medals in Fencing
  • For Japan, Takashi Ono is the best player with 13 medals in Gymnastics
  • For Germany, Birgit Fischer-Schmidt is the best player with 12 medals in Canoeing
  • For the UK, Henry Taylor with 8 medals in swimming
  • For Finland, Paavo Johannes Nurmi with 12 medals in Athletics
  • For China, Yang Wei won 7 medals each in Badminton and Gymnastics

Athlete Wise Analysis¶

1. Age vs Medal¶

In [123]:
import plotly.figure_factory as ff
In [124]:
athlete_df=df.drop_duplicates(subset=['Name','region'])
In [125]:
x1 = athlete_df['Age'].dropna()
x2 = athlete_df[athlete_df['Medal'] == 'Gold']['Age'].dropna()
x3 = athlete_df[athlete_df['Medal'] == 'Silver']['Age'].dropna()
x4 = athlete_df[athlete_df['Medal'] == 'Bronze']['Age'].dropna()
In [126]:
fig = ff.create_distplot([x1,x2,x3,x4],['Ovreall Age','Gold Medalist','Silver Medalist','Bronze Medalist'],show_hist=False,show_rug=False)
fig.show()
  • When an Athelets age is 10 to 17 and the athlete is entering a particular Sport, Then his chance to win a Silver Medal is more than that of Gold Medal
  • When an Athelet age is 21 to 25 Then his chance to win a Gold Medal is more than that of Silver Medal
  • When an Athelet age is between 26 to 32 Then his chance to win a Silver Medal is more than that of Gold Medal
  • When an Athelet age is 33 to 36 Then his chance to win a Gold Medal is more than that of Silver Medal
  • When an Athletes age is 22-23 his probability to win a medal is maximum

2.Sport vs Age¶

In [127]:
famous_sports = ['Basketball', 'Judo', 'Football', 'Tug-Of-War', 'Athletics',
       'Swimming', 'Badminton', 'Sailing', 'Gymnastics',
       'Art Competitions', 'Handball', 'Weightlifting', 'Wrestling',
       'Water Polo', 'Hockey', 'Rowing', 'Fencing', 
       'Shooting', 'Boxing', 'Taekwondo', 'Cycling', 'Diving', 'Canoeing',
       'Tennis',  'Golf', 'Softball', 'Archery',
       'Volleyball', 'Synchronized Swimming', 'Table Tennis', 'Baseball',
       'Rhythmic Gymnastics', 'Rugby Sevens',
       'Beach Volleyball', 'Triathlon', 'Rugby',  'Polo',
        'Ice Hockey']
In [128]:
x = []
name = []
for sport in famous_sports:
    temp_df = athlete_df[athlete_df['Sport'] == sport]
    x.append(temp_df[temp_df['Medal']=='Gold']['Age'].dropna())
    name.append(sport)
In [129]:
fig =ff.create_distplot(x,name,show_hist=False,show_rug=False)
fig.show()
  • For Rhythmic Gymnastics, when an athlete is in age between 15-and 20 the chance to win a gold medal is maximum
  • For Beach Volleyball, when an athlete is aged between 25 and 30 the chance to win a gold medal is maximum
  • For an Art competition one athlete can play up to 59 years
  • For Polo chance for wining a gold medal is 33 to 36
  • For Cycling 20 - 23 is the best years
  • One Athelitc can play Shooting up to 60 years
  • One Athelitc can play Water Polo for up to 40 years and its peak year is 22-24
  • As a comparison between Polo and Water polo, age is a vital factor for water polo

Height vs Weight¶

In [130]:
athlete_df['Medal'].fillna('No Medal',inplace=True)
In [131]:
plt.figure(figsize=(10,10))
temp_df = athlete_df[athlete_df['Sport'] == 'Weightlifting']  # Compare with a string, not a list
sns.scatterplot(x=temp_df['Weight'], y=temp_df['Height'], hue=temp_df['Medal'], style=temp_df['Sex'], s=100)
plt.show()  # Optional: To display the plot
No description has been provided for this image
  • In Wrestling Males is more than female
  • In Weightlifting female is more than that of male

Men vs Women Participation¶

In [132]:
men = athlete_df [athlete_df ['Sex'] == 'M'].groupby('Year').count()['Name'].reset_index()
women = athlete_df[athlete_df['Sex'] == 'F'].groupby('Year').count()['Name'].reset_index()
In [133]:
final = men.merge(women,on='Year',how='left')
final.rename(columns={'Name_x':'Male','Name_y':'Female'},inplace = True)
In [134]:
final.fillna(0,inplace=True)
In [135]:
fig = px.line(final,x='Year',y=['Male','Female'])
fig.show()
  • When Summer Olympics started, Female participation is 0, But as this event is going on female participation is continously incresing and first female paricipation is in 1900 and it is 23 number

conclusion¶

  • Till 2016, the Summer Olympics was played 28 times starting from 1896 to 2016
  • Till 2016, 23 cities hosted the Summer Olympics.
  • Till 2016, 651 Event conducted in 52 Sports
  • A total of 116122 Athletes Participate
  • A total of 206 nations participated in the Summer Olympics till 2016
  • In 1896 only 12 nations participated but in 2016 it jumped up to 204
  • In Athletics, the number of Events are increasing from 12 in 1896 to 47 in 2016
  • Baseball was played 5 times in Summer Olympics history from 1992 to 2008 with 1 event each time
  • Boxing is continuously increasing with 13 events in 2016
  • Cricket was played only one time in the year 1900
  • Cycling has 18 events each in the last 5 Olympics
  • Except in 1896 and 1932 football was played in every Olympics with 2 events in the last 6 Olympics
  • Golf played 3 times i.e 1900,1904 and 2016 with 2 events in each Olympics
  • Gymnastics is played in every Olympics, in the last 15 Olympics gymnastics was played in 14 different category
  • Except for 1st Olympics Rowing was played in every Olympics
  • Swimming played in every Olympics with 34 categories in 2016
  • Volleyball was added in 1964 with 2 events and continued the same till 2016
  • Weightlifting and Wrestling are two famous sports in Olympic history with 15 and 18 events in 2016 respectively
  • In the history of the Summer Olympics Michael Fred Phelps from the USA won 28 medals in Swimming and is the most successful player.
  • Russia won 41 medals in Athletics in the year 1980, which is the best achievement in a particular sport in a particular year in Summer Olympics history for Russia
  • For Russia Athletics, Gymnastics and wrestling are the top three sports
  • The USA won 65 medals in Athletics in the year 1904, which is the best achievement in a particular sport in a particular year in Summer Olympics history for the USA
  • For the USA Athletics and Swiming are the top two sports
  • For India Hockey is the best sports
  • For India, Udham Singh Kular and Leslie Walter Claudius are the two best Athelet with 4 medals each in Hockey
  • For Russia, Larysa Semenivna Latyninar is the best Athelet with 18 medals in Gymnastics
  • For Italy, Edoardo Mangiarotti is the best player with 13 medals in Fencing
  • For Japan, Takashi Ono is the best player with 13 medals in Gymnastics
  • For Germany, Birgit Fischer-Schmidt is the best player with 12 medals in Canoeing
  • For the UK, Henry Taylor with 8 medals in swimming
  • For Finland, Paavo Johannes Nurmi with 12 medals in Athletics
  • For China, Yang Wei won 7 medals each in Badminton and Gymnastics
  • When an Athelets age is 10 to 17 and the athlete is entering a particular Sport, Then his chance to win a Silver Medal is more than that of a Gold Medal
  • When an Athelet age is 21 to 25 Then his chance to win a Gold Medal is more than that of a Silver Medal
  • When an Athelet age is between 26 to 32 Then his chance to win a Silver Medal is more than that of a Gold Medal
  • When an Athelet age is 33 to 36 Then his chance to win a Gold Medal is more than that of a Silver Medal
  • When an Athletes age is 22-23 his probability to win a medal is maximum
  • For Rhythmic Gymnastics, when an athlete is in age between 15-and 20 the chance to win a gold medal is maximum
  • For Beach Volleyball, when an athlete is aged between 25 and 30 the chance to win a gold medal is maximum
  • For an Art competition, one athlete can play for up to 59 years
  • For Polo, the chance of winning a gold medal is 33 to 36
  • For Cycling 20 - 23 is the best years
  • One Athelitc can play Shooting up to 60 years
  • One Athelitc can play Water Polo for up to 40 years and its peak year is 22-24
  • As a comparison between Polo and Water polo, age is a vital factor for water polo
  • In Wrestling Males are more than female
  • In Weightlifting, females are more than male
  • When the Summer Olympics started, Female participation was 0, But as this event has been going on female participation has continuously increased and first female participation was in 1900 and it is 23 number

END¶